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ABSTRACT 


Data Mining is special and important technique utilized to have better business solutions. General survey on Data Mining Tech¬ 
niques, Methods, Tools and Challenges of Data Mining in all the domains, is at most important and so much in demand, apply¬ 
ing machine learning concepts and techniques in medical field is also most essential in the present scenario. In this paper the 
following chapters are presented- various data mining techniques, merits and demerits of data mining, open source data mining 
tools available, and also domains or industries applied DM. 
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INTRODUCTION 

The survey conduction and publishing the survey result is 
most expected one, in this research report. DM Techniques 
and how much the DM technique contributed in the health¬ 
care field [14] is presented. 


II. TYPES OF DM TECHNIQUES: 

1. CLASSIFICATION: 

It is important DM Technique; Classification predicts a cer¬ 
tain output based on a set of pre classified examples and it is 
the mostly used data mining technique. Classification can be 
broadly divided into supervised and unsupervised algorithms 
.Major classification method are decision tree induction, 
Bayesian networks, linear programming, neural network and 
fuzzy logic technique. [13][26][27][28][29][30]. 

2. CLUSTERING: 

It is important DM Technique; Clustering groups similar 
and dissimilar objects. There are number of clustering mod¬ 
els which can be used for different applications. It can also 


be used as a pre-processing approach for attribute subset se¬ 
lection and classification [4].Clustering mainly used for pat¬ 
tern recognition, machine learning and information retrieval. 

2.1 K-MEANS CLUSTERING: 

It is very Simple clustering approach, less complex method 
and also efficient. In advance it requires number of cluster to 
proceed further. It is having problem in handling categori¬ 
cal attributes. It will not predict the cluster with non-convex 
shape. Outcome varies in the presence of outlier. 

2.2 HIERARCHICAL CLUSTERING: 

Easy to implement and having good visualization capability. 
Not necessary to mention the number of clusters in advance. 
It has cubic time complexity in many cases so it is slower. 
Once a decision is made it cannot be withdrawn. It will not 
work proper in the presence of noise. It is not scalable one. 

2.3 DENSITY BASED CLUSTERING: 

No need to specify number of cluster in advance. It is very 
simple to handle cluster with arbitrary shape. It will work 
well in the presence of noise. It will not handle the data 
points with varying densities. Results will be based on the 
distance measure. 
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3. PREDICTION: 

It is important DM Technique; The prediction as it name im¬ 
plied is one of a data mining techniques that discovers rela¬ 
tionship between independent variables and relationship be¬ 
tween dependent and independent variables. Unfortunately 
many real world problems are not simply prediction. For 
instance, stock price, sales volumes are difficult to predict, 
therefore we need more complex techniques like logistic re¬ 
gression, decision tress and neural networks [4]. 

4. ASSOCIATION: 

It is important DM Technique; Association is one of the best 
known data mining technique. In association, a pattern is dis¬ 
covered based on a relationship of a particular item on other 
items in the same transaction. For example, the association 
technique is used in heart disease prediction as it tell us the 
relationship of different attributes used for analysis and sort 
out the patient with all the risk factor which are required for 
prediction of disease[4]. 

5. NEURAL NETWORK: 

It is important DM Technique; Neural networks models have 
been studied for many years in the hope of achieving hu¬ 
man like performance in several fields. As the human brain 
consists of millions of neurons that are interconnected by 
synapses, a neural network is a set of connected input/output 
units in which each connection has a weight associated with 
it. The network learns in the learning phase by adjusting the 
weights so as to be able to predict the correct class label of 
the input. They are the best at identifying patterns or trends 
in data and well suited for prediction or forecasting needs 
.due to their performance neural networks have been widely 
used in experiments and adopted for critical biomedical clas¬ 
sification and clustering problems. Its merits and demerits 
are it will easily identify complex relationships between 
dependent and independent variables. Able to handle noisy 
data. Local minima and over-fitting. The processing of ANN 
network is difficult to interpret and require high processing 
time if there are large neural networks[12]. 

6. DECISION TREE(DT): 

It is important DM Technique; Decision tree learning uses a 
decision tree as a predictive model which maps observations 
about an item to conclusions about the item’s target value. 
In decision analysis, a decision tree can be used to visually 
and explicitly represent decisions and decision making. In 
data mining, a decision tree describes data but not decisions; 
rather the resulting classification tree can be an input for de¬ 
cision making. The goal is to create a model that predicts the 
value of a target variable based on several input variables. Its 
merits and demerits are there are no necessities of domain 
knowledge in the construction of decision tree. It minimizes 
the ambiguity of complicated decisions and assigns exact 


values to outcomes of various actions. It can easily process 
the data with high dimension. It is easy to interpret. Deci¬ 
sion tree also handles both numerical and categorical data. It 
is restricted to one output attribute. It generates categorical 
output. It is an unstable classifier i.e. perfonuance of classi¬ 
fier is depend upon the type of dataset. If the type of dataset 
is numeric than it generates a complex decision tree[9]. 

7. SUPPORT VECTOR MACHINE (SVM): 

It is important DM Technique; Support vector machine is 
an algorithm that attempts to find a linear separator (hyper¬ 
plane) between the data points of two classes in multidimen¬ 
sional space. SVMs are well suited to dealing with interac¬ 
tions among features and redundant features. Its merits and 
demerits are Better accuracy as compare to other classifier. 
Over fitting problem is not as much as other methods. Eas¬ 
ily handle complex nonlinear data points. It is computation¬ 
ally expensive. The main problem is the selection of right 
kernel function. For every dataset different kernel function 
shows different results. As compare to other methods train¬ 
ing process take more time. SVM was designed to solve the 
problem of binary class. It solves the problem of multi class 
by breaking it into pair of two classes such as one- against- 
one and one-against- all [9]. 

8. GENETIC ALGORITHMS (GAS) AND EVOLUTION¬ 
ARY PROGRAMMING (EP): 

It is important DM Technique; Genetic algorithms and evo¬ 
lutionary programming are algorithmic optimization strate¬ 
gies that are inspired by the principles observed in natural 
evolution. Of a collection of potential problem solutions that 
compete with each other, the best solutions are selected and 
combined with each other. In doing so, one expects that the 
overall goodness of the solution set will become better and 
better, similar to the process of evolution of a population of 
organisms. Genetic algorithms and evolutionary program¬ 
ming are used in data mining to formulate hypotheses about 
dependencies between variables, in the form of association 
rules or some other internal formalism [12]. 

9. FUZZY SETS (FS): 

It is important DM Technique; Fuzzy sets form a key meth¬ 
odology for representing and processing uncertainty. Uncer¬ 
tainty arises in luany fonus in today’s databases: imprecision, 
non-specificity, inconsistency, vagueness, etc. Fuzzy sets ex¬ 
ploit uncertainty in an attempt to make system complexity 
manageable. As such, fuzzy sets constitute a powerful ap¬ 
proach to deal not only with incomplete, noisy or imprecise 
data, but may also be helpful in developing uncertain models 
of the data that provide smarter and smoother performance 
than traditional systems[12]. 
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10. ROUGH SETS(RS): 

It is important DM Technique; in this a rough set is deter¬ 
mined by a lower and upper bound of a set. Every lueiuber 
of the lower bound is a certain member of the set. Every non¬ 
member of the upper bound is a certain non-member of the 
set. The upper bound of a rough set is the union between the 
lower bound and the so-called boundary region 

11. K-NEAREST NEIGHBOR (KNN): 

It is important DM Technique. It is very easy to impleiuent. 
Training is done in faster luanner. It requires large storage 
space. It is very sensitive to noise. Testing is slow [14]. 

12. BAYESIAN BELIEF NETWORK (BBN) 

It is important DM Technique, It makes computations pro¬ 
cess easier. Have better speed and accuracy for huge data¬ 
sets. It does not give accurate results in soiue cases where 
there exists dependency aiuong variables [3]. 


III. DATA MINING MERITS AND DEMERITS 
MERITS: 

• It predicts future trends 

• It helps in decision making 


• It helps to improve company revenue 

• It is mainly used in Market Analysis 

• It is effectively utilized in fraud detection 

• DM techniques applied in health care insurers to de¬ 
tect fraud and abuse. 

• DM helps the Physicians to identify effective treat¬ 
ments and best practices through healthcare soft¬ 
ware’s. 

• Using data mining it is possible to speed up the work 
in large data sets. 

• DM facilitates generation of quicker reports and faster 
analysis, which will increase operational efficiency 
and also diminishes operating cost. 

• Data mining can extract predictive information from 
large database, which is a very important. 

DEMERITS: 

• Heterogeneity of data volume and complexity will 
create unnecessary mathematical categorization. 

• Must consider Ethical legal and Social issues. 

• Dealing Data ownership. Lawsuits 

• Privacy and Security of Human Data Administrative 
Issues - Medical data. 

• Other general Security Issues. 

• Misuse of information or in accurate information. 


IV. OPEN SOURCE DATA MINING TOOLS AVAILABLE: 


Table 1; Open Source Tools and its strengths [2][7]. 


The comparison of DM tools; KNIME, RapidMiner- presented in the table no. 1 


RapidMiner 


1 


Partitioning of dataset to training and testing Only limited partitioning 
sets abilities 


Less/ Limited partitioning abilities 


2 

Type 

Enterprise Reporting, Busi¬ 
ness Intelligence 

Statistical Analysis, Data mining. Pre¬ 
dictive Analytics, Clustering. 

3 

Scaling 

Facility Available 

Available with this facility 

4 

Language and OS 

Linux ,OS X, Windows 

Cross Platform Language Inde¬ 

pendent 

5 

Selection 

Wrapper methods 

Available with this facility 

6 

Parameter optimization of machine learning/ 
statistical methods 

Does not have automatic 
facility 

Available with this facility 

7 

Model validation using crossvalidation and/ 
or independent validation set 

Only Less error measure¬ 
ment methods 

Available with this facility 

8 

Advantages 

Molecular analysis, Mass 
spectrometry. 

Visualization, Statistical, Attribute Se¬ 
lection, Outlier detection, Parameter 
Optimization are all possible. 

9 

Limitations 

Limited error measurements 

Requires sound knowledge of dealing 
with database 
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Table 2: DM - Open Source Tools and its strengths [2], [7] 


SI No 

Attributes 

Weka 

Orange 

1 

Partitioning of dataset to training and testing 
sets 

Less partitioning abilities 

Limited partitioning abilities 

2 

Type 

Machine Learning. 

Machine Learning, Data mining 

3 

Scaiing 

Not possible 

Not possible 

4 

Language and Operating System 

Cross Platform mainly using 
Java 

Cross Platform mainly using 
Python C + J-,C 

5 

Selection 

Possible Partially 

Not possible 

6 

Parameter optimization of machine iearning/ 
statistical methods 

Not automated 

Automatic facility not possible 

7 

Model validation using cross validation and/or 
independent validation set 

Partially possible 

Partially possible 

8 

Advantages 

Simple to use 

Better debugger, Shortest 

Scripts possible. 

9 

Limitations 

Poor documentation 

Limited reporting capabilities 


The comparison of DM tools: Weka, ORANGE- presented in the table no. 2 


V. DATA MINING TECHNIQUES APPLIED DO¬ 
MAINS: 

Data mining is an interdisciplinary field and with wide di¬ 
verse applications. There be nontrivial gaps between data 
mining principles and domain-specific applications, few ap¬ 
plication domains of Data Mining are listed below, 

• Healthcare 

• Finance 

• Retail industry 

• Telecommunication 

• Text Mining 

• Web Mining 

• Higher Education , etc 

Tremendous results and reports received in all the above 
fields by the effective utilization of DM[30]. 


VI. CONCLUSION: 

In this paper we presented various data mining techniques 
that have been employed for medical data mining. Data min¬ 
ing techniques have higher utility in medical data mining as 
there is voluminous data in this industry. Due to the rapid 
growth of medical data, it has become in dispensable to use 
data mining technique to help decision support and predic¬ 
tion system in the field of health care. This paper has pro¬ 
vided the summary of data mining techniques used for all 
the domains. 
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